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ABSTRACT 

The BloodChIP database (http://www.med.unsw. 
edu.au/CRCWeb.nsf/page/BloodChlP) supports 
exploration and visualization of combinatorial tran- 
scription factor (TF) binding at a particular locus in 
human CD34-positive and other normal and leu- 
kaemic cells or retrieval of target gene sets for 
user-defined combinations of TFs across one or 
more cell types. Increasing numbers of genome- 
wide TF binding profiles are being added to public 
repositories, and this trend is likely to continue. For 
the power of these data sets to be fully harnessed 
by experimental scientists, there is a need for these 
data to be placed in context and easily accessible 
for downstream applications. To this end, we have 
built a user-friendly database that has at its core the 
genome-wide binding profiles of seven key haem- 
atopoietic TFs in human stem/progenitor cells. 
These binding profiles are compared with binding 
profiles in normal differentiated and leukaemic 
cells. We have integrated these TF binding profiles 
with chromatin marks and expression data in 
normal and leukaemic cell fractions. All queries 
can be exported into external sites to construct 
TF-gene and protein-protein networks and to 
evaluate the association of genes with cellular 
processes and tissue expression. 

INTRODUCTION 

Transcription factors (TFs) and the cLv-regulatory se- 
quences to which they bind form the building blocks of 
gene regulatory networks that govern gene expression and 
give cells their unique identity (1). Stem cells have the 
capacity to regenerate themselves and also to give rise to 



progeny with increasingly specialized function and re- 
stricted proHferative capacity (2). Haematopoiesis is one 
of the best described developmental systems in vertebrates 
and serves as a model system for the multi-lineage differ- 
entiation of adult stem cells. However, little is known 
about the components and hierarchy of the gene regula- 
tory network that controls the identity of human adult 
haematopoietic stem cells (HSCs). Understanding the 
ground state of the transcriptional network in HSCs is 
the first step to better understanding how these networks 
change as HSCs respond to external cues and proliferate 
or differentiate. Stem cell signatures are also corrupted 
and persist in leukaemic cells and confer an adverse prog- 
nosis (3,4). Knowledge of the components and hierarchy 
of the normal HSC transcriptional network may provide 
clues to dismanthng stem cell networks in leukaemic cells. 

TFs often act in 3-D space as components of multi- 
protein complexes that bind distal regulatory elements 
and interact with promoters to initiate, augment or 
suppress gene expression in conjunction with a plethora 
of chromatin modifiers and transcription initiation/ 
elongation factors (5). In the post-genomics era, classical 
methods to screen for potential gene-regulatory regions, 
such as DNasel hypersensitivity mapping by Southern 
blotting of selected loci, have been superseded by 
new genome-wide techniques such as Chromatin 
Immunoprecipitation (ChIP) and DNasel hypersensitivity 
assays coupled to high throughput sequencing (ChlP-seq 
and DNasel-seq, respectively) (6,7). These large-scale 
screening techniques help rapidly identify the positions 
of a large number of candidate regulatory elements and 
genome-wide binding patterns for multiple TFs. 

A core set of seven TFs— FLH, ERG, GATA2, 
RUNXl, SCL/TALl, LYLl and LM02— work in com- 
bination to regulate gene expression in haematopoietic 
stem/progenitor cells (HSPCs) (8). This heptad of TFs 
also contributes to stem cell-like signatures in leukaemic 
cells, and their presence impacts adversely on patient 
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survival (4). Knowledge of how these TFs cooperate with 
each other and interact with other lineage-specific TFs to 
regulate gene expression during HSPC differentiation is 
vital to improving our understanding of normal blood 
development. To this end, we have recently generated 
high-quality ChlP-seq data for these seven haematopoietic 
TFs in primary human CD34^ blood stem and progenitor 
cells (9). To optimize the utility of these data, we 
have integrated our ChlP-seq and RNA-seq data sets 
with chromatin accessibihty marks from the Human 
Epigenome Atlas (6) and Encyclopaedia of DNA 
Elements (ENCODE) (10) as well as published ChlP-seq 
and expression data from normal (11-13) and leukaemic 
stem cell fractions (14). All analysed and filtered data sets 
can be exported for downstream analysis. To facilitate this 
process, we have also created links to external resources 
such as the UCSC genome browser to visualize data, 
Cytoscape (15) and STRING (16) to construct TF-gene 
and protein-protein networks and the Gene Expression 
Atlas (17) to query expression in other data sets and 
Gene Set Enrichment Analysis (GSEA) (18) to associate 
a hst of genes with cellular processes. 

MATERIALS AND METHODS 

Data sources 

All data sources are summarized in Supplementary 
Table SI and a schematic diagram illustrating data sets 
and features of BloodChIP is shown in Figure 1 . 

TF ChlP-seq 

Genome-wide TF binding profiles for Human CD34+ 
HSPC (GSE45144), megakaryocytes (GSE24674), the 
acute myeloid leukemia (AML) cell line SKNO-1 
(GSE23730) and K562 (GSE24779, GSE29196, 
GSE31477) were obtained from the Gene Expression 
Omnibus (GEO) (9,12,19-21). To standardize the way 
TF binding sites are obtained from the data set, peak 
calling was performed using an improved automated 
pipeline that we used previously to analyse the Human 
CD34+ HSPC data set (9). Briefly, sequencing reads 
were aligned to the hgl9 reference genome using 
Burrows- Wheeler Ahgner (BWA) (22). Peak calhng was 
performed using three peak calhng algorithms, HOMER 
(23), MACS (24) and SPP (25). Only peaks that were 
called by two or more of the algorithms are kept as part 
of the final peak Hst. Aligned reads were also extended 
to 200 bp to generated TF-binding genome coverage 
profiles for visualization on the UCSC genome browser. 
A schematic diagram illustrating the pipehne is shown in 
Figure 2. 

To obtain a uniform set of TF binding sites across all 
TFs and across aU cefl fines, afi peaks were merged. This 
was achieved by sequentially merging overlapping peaks 
across pairs of data set until all data sets have been 
merged. Each region was then marked as either bound 
or unbound for each TF data set. TF binding peaks 
were mapped to nearby genes using the genomic regions 
enrichment of annotations tool (GREAT) (26), which 
defines genomic neighbourhoods for TF-bound peaks by 



assigning weights to flanking genes based on their distance 
to the peak. 

Histone ChlP-seq 

Histone ChlP-seq profiles from mobilized CD34+ ceUs 
and K562 cells were obtained from the NIH Roadmap 
Epigenomics Project (GSE 18927) (6) and the ENCODE 
project (GSE29611) (21), respectively. Histone profiles for 
H3K27AC, H3K4mel and H3K4me3 were analysed, and 
binding was quantified by counting the number to tags 
mapping to 1.5 kbs adjacent to the centre of each TF 
binding site. A read count of greater than 10 tags was 
used as evidence for the presence of a particular histone 
mark, which has been shown previously to adequately dis- 
tinguish true signal from noise (9). 

Gene expression 

Genome-wide microarray expression data for human 
normal CD34+ cells (GSE30029) (27), SKNO-1 cells 
(GSE34594) (28) and K562 (GSE28135) (29) were 
obtained from GEO (30), and expression data for 
megakaryocytes (E-TABM-633) (11) was obtained from 
ArrayExpress (31). Illuniina expression arrays were used 
in aU of the above studies, the non-normalized expression 
data from each data set was log2 transformed and quantile 
normalized across all samples together for all cell types. 
Where multiple microarray probes are available for one 
gene, the probe with the highest average expression value 
across all samples was selected to represent the expression 
of the gene. 

Genome-wide expression data comparing the stem and 
progenitor sub-fractions of human CD34+ ceUs of normal 
donors and AML patients was obtained from GEO 
(GSE24006) (14). Expression was measured on an 
Affymetrix array. Robust Multi-chip Average approach 
was used for normahzation and expression levels 
summarized. Probes were mapped to respective genes 
and expression values for each gene were stored in the 
database. 

DNasel hypersensitive site and digital genomic 
footprints 

For analysis of chromatin accessibility and digital genomic 
footprints, annotations for mobilized CD34+ cells and 
K562 cells (GSM646567) were obtained from previously 
pubhshed data (32). A TF bound region was annotated 
as chromatin accessible if the peak region overlapped 
with a DNase 1 hypersensitive site (HS) region. 
Furthermore, the number of digital genomic footprints 
was counted within each DNase I HS region. 

Database implementation, web interface and availability 

All processed data were stored in a local MySQL (version 
5.5.8) relational database. A web interface has been 
designed to provide user-friendly access for database 
query and visualization of TF binding and gene expression 
through PHP-based scripts. The web interface is ac- 
cessible at: http://www.med.unsw.edu.au/CRCWeb.nsf/ 
page/bloodChlP. BloodChIP also supports REST 
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Datasets 



Web-interface features 



TF ChlP-seq 



HistoneChlP-seq 



Expression microarray 



DNase l/DGF 




Search: 

Gene name/coordinates 
Transcription factor binding 
Differential Expression 



Data export: 

Peal<list 

Gene expression 

Cytoscope 

GSEA 



External links: 

UCSC Genome Browser 
Expression Atlas 
STRING 



Visualisation: 

Combinatorial binding 
chromatin state 
Gene expression 



Figure 1. Schematic summarizing data currently held by the BloodChIP database and features of the web interface. The database integrates TF 
ChlP-seq, histone ChlP-seq, DNase I/digital genomic footprinting (DGF) and expression microarray data. The web interface provides methods to 
query and visualize data and further links to external databases for further data analysis. 




Figure 2. Schematic illustrating the ChlP-seq data analysis pipeline used for processing of ChlP-seq data sets used to populate the BloodChIP 
database. The two outputs from the pipeline are the genome coverage profiles for visualization in UCSC and a list of transcription factor binding 
sites. 
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(REpresentational State Transfer) Web Services to enable 
programmatic access to the database. A tutorial is 
provided at the BloodChlP web page describing the use 
of the REST service. All data from BloodChlP are also 
available for download as SQL files. 

RESULTS: DATABASE FEATURES 

General web interface 

By default, the web interface to the database displays a 
main table that hsts all 23 804 genes in the human genome 
as a sortable table (Supplementary Figure SI). For each 
gene, the total number of heptad ChlP-seq peaks 
associated with the gene is shown. A link to the UCSC 
genome browser is also provided to allow visualization of 
tracks for TF ChlP-seq, histone profiles and DNasel HS 
at the respective gene locus. Clicking on the row of each 
gene will display the distribution of expression for a par- 
ticular gene in CD34+, megakaryocytes, SKNO-1 and 
K562 cells in the form of a box plot on the right of the 
main table. This will also display checkboxes that indicate 
TF binding associated with the particular gene. 

To explore TF ChlP-seq peaks associated with a particu- 
lar gene, the row containing the gene can be expanded. The 
genomic coordinates of each peak is shown, along with red 
boxes to indicate the binding of a particular TF from a 
particular cell line at the particular peak. A blue box indi- 
cates the absence of binding and a white box indicates that 
the TF-binding profile is not present for the particular cell 
type. The presence of each of the histone marks and DNase 
I hypersensitivity is also shown. Finally, the number of 
digital genomic footprints is also indicated. The binding 
status of a maximum of seven TFs in three cell types can 
be displayed on the web interface at any one time. The 
specific TFs and cell types displayed can be customized 
by the user using the display filter above the main table. 

Querying the database 

Two methods are available for querying the database 
using the web interface, namely using gene symbol/ 
genomic coordinates and TF binding. For queries using 
gene/genomic coordinate, a search dialogue is provided to 
accept searches using the gene symbol, Refseq ID or 
genomic coordinates in bed format. A user can either 
paste in the query Hst or upload this as a file. 

For retrieval of genes regulated by particular TF, a user 
can select specific TFs for the different cell types available 
in the database. A query can be constructed with any 
combination of TFs to identify either genes that are com- 
binatorially bound by multiple TFs ("AND") or bound by 
any of the TFs ("OR"). Furthermore, an option is avail- 
able to filter the retrieved genes based on differential gene 
expression between any of the cell types. A cut-off can be 
specified to retain only genes that are above or below a 
certain fold change threshold. 

Links to external database and tools 

To facilitate downstream analysis and data visualization, 
hyperlinks taking advantage of external APIs are 



provided. To further investigate the expression of a 
single gene in other experiments, a hnk is provided to 
the Gene Expression Atlas, displaying condition and 
experiment-specific expression patterns of the gene of 
interest. The Gene Expression Atlas is useful for the iden- 
tification of publically available gene expression data sets 
that can be used for further analysis and vahdation of 
hypotheses. 

To explore protein-protein interactions of the set of 
genes displayed in main table, a link is provided to the 
STRING database (16). The STRING database can be 
used to reveal known and predicted physical and func- 
tional protein interactions. This is particularly useful for 
the discovery of major protein-protein interaction hubs 
and for the prioritization of genes for further analysis. 

Exporting data 

The web interface provides four methods for the exporting 
of data from a queried set of genes. Firstly, TF binding 
information can be exported for individual genes in tab- 
delimited format. Secondly, the normalized expression 
values of all genes retrieved by a query for aU samples 
can be exported. This facilitates external analysis such as 
clustering of genes or the generation of heatmaps. Thirdly, 
the set of genes retrieved from the query can be exported 
as a gene set for use in GSEA. Finally, to allow the visu- 
alization of the TF-gene regulatory network, a file can be 
exported in a format that can be immediately visualized 
and analysed using the Cytoscape software package. 

Utility of BloodChlP: a working example 

If the user is interested in retrieving TF binding data at the 
RUNXl locus, a query for RUNXl is initiated by typing in 
the gene name or gene coordinates. The default settings 
retrieve all combinations of binding sites with one or more 
TF peaks that have been mapped to a locus by GREAT 
(26) (Supplementary Figure S2A). The resulting view 
shows peak coordinates in hCD34, Megakaryocytes and 
AML ceUs with a hnk to the UCSC browser and a 
checkerboard view of TF(s) bound to this region 
(Supplementary Figure S2B). The Chr21: 36398905- 
36399463 interval, which is bound by all seven TFs and 
has active chromatin marks, corresponds to the Rim.x]+23 
stem cell enhancer in mice (33). 

This view also permits easy visualization of comparative 
binding profiles at these or other regions in primary 
megakaryocytes (12) and AML cells (20). The gene 
expression view (Supplementary Figure S2C) to the right 
shows RUNXl expression across HSCs, multi-potent 
progenitors (MPP), common myeloid progenitors 
(CMP), granulocyte-monocyte progenitors (GMP) or 
megakaryocyte-erythroid progenitor (MEP) fractions as 
weU as in AML leukaemic stem cells (LSC; Lin-/ 
CD34+/38-/CD90-), AML leukaemic progenitor ceUs 
(Lin-/34+/38+) and AML blasts (Lin-/34-) (14), 
megakaryocytes and AML cells. Had the biological 
function of the +23 enhancer not been known, this 
region would have been the prime candidate for functional 
testing as a regulator of a gene that is both important for 
normal blood development and is mutated in leukaemia. 
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A tab at the top left corner permits easy export of data 
contained in this view. 

Alternatively, if the user wished to retrieve all targets for 
RUNXl alone or in combination with one or more TFs, 
the appropriate options corresponding to the particular 
cell type(s) of interest can be selected to yield a list of 
genes that can either be viewed on UCSC or exported to 
retrieve coordinates. For example, if the selects RUNXl 
(CD34) and FLU (CD34), the user will retrieve sites with 
combinatorial binding for RUNXl and FLU in CD34+ 
cells. If on the other hand RUNXl (CD34) or FLU 
(CD34) is chosen, the user will retrieve all RUNXl coord- 
inates and FLU coordinates in CD34+ cells. Protein- 
protein and TF-gene interactions for this list can also be 
visualized by following the adjacent tabs to STRING 
(Supplementary Figure S2E) and Cytoscape 
(Supplementary Figure S3 A). Data can also be exported 
into GSEA (Supplementary Figure S3B) to evaluate asso- 
ciations with cellular processes or for other applications 
such as generation of heatmaps using a tool of choice 
(Supplementary Figure S3C). Another feature of the 
database is the function to filter outputs based on differ- 
ential expression between normal HSCs and more 
differentiated normal blood subsets or normal HSCs and 
leukaemic stem cell fractions. Binding profiles and binding 
coordinates of each gene on the Ust can be accessed and 
compared between normal HSCs and leukaemic cell lines. 



DISCUSSION 

Combinatorial interactions of TFs are key determinants of 
cell identity (34). We have recently generated genome-wide 
high resolution binding profiles for seven key haematopoi- 
etic TFs in primary human CD34+ haematopoietic stem 
progenitor cells (HSPCs) (9). We have now integrated 
combinatorial TF binding data with quantitative gene 
expression, histone modification and digital genomic foot- 
printing data in these cells from the Human Epigenome 
Atlas (6) and ENCODE (10) and created a user-friendly 
database that allows users to (i) Interrogate overlapping 
TF binding at a locus of interest, (ii) Establish binding 
coordinates and associated gene targets for TFs of 
interest, (iii) Compare binding profiles between cell types 
and (iv) Filter Usts based on differentially expressed genes. 
Additional features include tools to construct visual maps 
of TF-gene and protein-protein interaction networks 
using links to Cytoscape (35) and STRING (16), respect- 
ively. The overall aim of this database is to facilitate 
enquiry into dynamic changes to the transcriptional 
network of HSPCs as cells differentiate into specific 
lineages or transform into leukaemic cells. 

At present, we have included data sets for normal 
hCD34s, primary megakaryocytes and an AML cell line 
for which high-quahty combinatorial TF binding data are 
available. As more TF binding data are pubhshed and 
where the quality of these data sets accord with 
ENCODE guidehnes (36), BloodChIP will be expanded 
to accommodate more TF binding profiles in human 
CD34+ HSPCs as well as those in other normal human 
blood lineages and human leukaemic cell lines. We beheve 



that BloodChIP, which integrates combinatorial TF 
binding with gene expression in normal and maUgnant 
cells, will be of particular interest to biologists and 
bioinformaticians working in the fields of haematopoiesis 
and leukaemia and also more generally to those working 
on gene regulation and stem cell biology. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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